Goto

Collaborating Authors

 cumulative regret


Provably Efficient Multi-Task Meta Bandit Learning via Shared Representations

Neural Information Processing Systems

Learning-to-learn or meta-learning focuses on developing algorithms that leverage prior experience to quickly acquire new skills or adapt to novel environments. A crucial component of meta-learning is representation learning, which aims to construct data representations capable of transferring knowledge across multiple tasks--a critical advantage in data-scarce settings. We study how representation learning can improve the efficiency of bandit problems. We consider T d-dimensional linear bandits that share a common low-dimensional linear representation. We provide provably fast, sample-efficient algorithms to address the two key problems in meta-learning: (1) learning a common set of features from multiple related bandit tasks and (2) transferring this knowledge to new, unseen bandit tasks.


Transfer Faster, Price Smarter: Minimax Dynamic Pricing under Cross-Market Preference Shift

Neural Information Processing Systems

We study contextual dynamic pricing when a target market can leverage Kauxiliary markets--offline logs or concurrent streams--whose mean utilities differ by a structured preference shift. We propose Cross-Market Transfer Dynamic Pricing (CM-TDP), the first algorithm that provably handles such model-shift transfer and delivers minimax-optimal regret for both linear and nonparametric utility models. For linear utilities of dimension d, where the difference between source-and targettask coefficients is s0-sparse, CM-TDP attains regret eO (dK 1 + s0) log T .


Thompson Sampling in Function Spaces via Neural Operators

Neural Information Processing Systems

We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator's output. We assume that queries to the operator (such as running a high-fidelity simulator or physical experiment) are costly, while functional evaluations on the operator's output are inexpensive. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This strategy avoids explicit uncertainty quantification by treating trained neural operators as approximate samples from a Gaussian process (GP) posterior. We derive regret bounds and theoretical results connecting neural operators with GPs in infinite-dimensional settings.


Learning Across the Gap: Hybrid Multi-armed Bandits with Heterogeneous Offline and Online Data

Neural Information Processing Systems

The multi-armed bandit (MAB) is a fundamental online decision-making framework that has been extensively studied over the past two decades. To mitigate the high cost and slow convergence of purely online learning, modern MAB approaches have explored hybrid paradigms that leverage offline data to warm-start online learning. However, existing approaches face a significant limitation by assuming that the offline and online data are homogeneous--they share the same feedback structure and are drawn from the same underlying distribution. This assumption is often violated in practice, where offline data often originate from diverse sources and evolving environments, resulting in feedback heterogeneity and distributional shifts. In this work, we tackle the challenge of learning across this offline-online gap by developing a general hybrid bandit framework that incorporates heterogeneous offline data to improve online performance. We study two hybrid settings: (1) using reward-based offline data to accelerate online learning in preference-based bandits (i.e., dueling bandits), and (2) using preference-based offline data to improve online standard MAB algorithms. For both settings, we design novel algorithms and derive tight regret bounds that match or improve upon existing benchmarks despite heterogeneity. Empirical evaluations on both synthetic and real-world datasets further show the superior performance of our proposed methods over baseline algorithms.


Tightening Regret Lower and Upper Bounds in Restless Rising Bandits

Neural Information Processing Systems

Restless Multi-Armed Bandits (MABs) are a general framework designed to handle real-world decision-making problems where the expected rewards evolve over time, such as in recommender systems and dynamic pricing. In this work, we investigate from a theoretical standpoint two well-known structured subclasses of restless MABs: the rising and the rising concave settings, where the expected reward of each arm evolves over time following an unknown non-decreasing and a non-decreasing concave function, respectively. By providing a novel methodology of independent interest for general restless bandits, we establish new lower bounds on the expected cumulative regret for both settings. In the rising case, we prove a lower bound of order ΩpT2{3q, matching known upper bounds for restless bandits; whereas, in the rising concave case, we derive a lower bound of order ΩpT3{5q, proving for the first time that this setting is provably more challenging than stationary MABs. Then, we introduce Rising Concave Budgeted Exploration (RC-BEpαq), a new regret minimization algorithm designed for the rising concave MABs. By devising a novel proof technique, we show that the expected cumulative regret of RC-BEpαq is in the order of rOpT7{11q. These results collectively make a step towards closing the gap in rising concave MABs, positioning them between stationary and general restless bandit settings in terms of statistical complexity.


Tractable Multinomial Logit Contextual Bandits with Non-Linear Utilities

Neural Information Processing Systems

We study the multinomial logit (MNL) contextual bandit problem for sequential assortment selection. Although most existing research assumes utility functions to be linear in item features, this linearity assumption restricts the modeling of intricate interactions between items and user preferences. A recent work [41] has investigated general utility function classes, yet its method faces fundamental tradeoffs between computational tractability and statistical efficiency. To address this limitation, we propose a computationally efficient algorithm for MNL contextual bandits leveraging the upper confidence bound principle, specifically designed for non-linear parametric utility functions, including those modeled by neural networks. Under a realizability assumption and a mild geometric condition on the utility function class, our algorithm achieves a regret bound of eO( T), where T denotes the total number of rounds. Our result establishes that sharp eO( T)-regret is attainable even with neural network-based utilities, without relying on strong assumptions such as neural tangent kernel approximations. To the best of our knowledge, our proposed method is the first computationally tractable algorithm for MNL contextual bandits with non-linear utilities that provably attains eO( T) regret.


Learning to target with network interference

arXiv.org Machine Learning

This paper studies adaptive targeting under network interference in a bandit setting, where treatments applied to one individual may affect others through spillover effects. We consider a linear model in a sparse regime, where each individual's outcome can be affected by at most a few others. We first establish a regret lower bound showing that ignoring the network structure and reducing the problem to a standard linear bandit inevitably leads to inefficient learning, particularly in large populations. To understand how structural information can be leveraged, we analyze regimes with varying levels of knowledge of the interference structure: (1) full support knowledge, (2) knowledge of the column support sizes, and (3) no prior knowledge. For each regime, we establish regret lower bounds characterizing the fundamental limits of learning, and develop algorithms that achieve near-optimal regret. Together, our results provide a unified view of how knowledge of the interference structure governs the efficiency of online learning under interference, and offer practical adaptive targeting algorithms in each setting. Numerical experiments on synthetic and real-world data demonstrate the practical benefits of our algorithms.


Spectral bandits for smooth graph functions with applications in recommender systems

arXiv.org Machine Learning

Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each recommended item is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens nodes evaluations.


Adaptive Policy Learning Under Unknown Network Interference

arXiv.org Machine Learning

Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in order to maximize a cumulative outcome of interest (e.g. revenue). Existing adaptive experimentation methods either assume the interference network is fully known or bypass the network by operating on coarse cluster-level randomizations. We develop a Thompson sampling algorithm that jointly learns the interference network and adaptively optimizes individual-level treatment allocations via a Gibbs sampler. The algorithm returns both an optimized treatment policy and an estimate of the interference network; the latter supports downstream causal analyses such as estimation of direct, indirect, and total treatment effects. For additive spillover models, we show that total reward is linear in the treatment vector with coefficients given by an $n$-dimensional latent score. We prove a Bayesian regret bound of order $\sqrt{nT \cdot B \log(en/B)}$ for exact posterior sampling; empirically, our Gibbs-based approximate sampler achieves regret consistent with this rate and remains sublinear when the additive spillovers assumption is violated. For general Neighborhood Interference, where this reduction is unavailable, we analyze an explore-then-commit variant with $O(n^2 \log T)$ graph-discovery cost. An information-theoretic $Ω(n \log T)$ lower bound complements both results. Empirically, our method achieves more than an order-of-magnitude reduction in regret in head-to-head comparisons. On two real-world networks, the algorithm achieves sublinear regret and yields downstream effect estimates with small RMSE relative to the truth.


Tight Generalization Bounds for Noiseless Inverse Optimization

arXiv.org Machine Learning

Inverse optimization (IO) seeks to infer the parameters of a decision-maker's objective from observed context--action data. We study noiseless IO, where demonstrations are generated by a ground-truth objective. We provide a high-probability ${O}(\frac{d}{T})$ generalization bound for the induced action set, where $d$ is the number of unknown parameters and $T$ is the size of the training dataset. We strengthen these guarantees under additional conditions that ensure uniqueness of the chosen action, bringing our IO guarantees in line with best-arm identification results in the bandit literature. We further show that the ${O}(\frac{d}{T})$ rate is tight over all consistent estimators considered here, and extend the result to both instantaneous and cumulative regret. Notably, the resulting regret lower bound matches the corresponding upper bounds in the adversarial setting, indicating that the stochastic IO setting is effectively adversarial for the class of estimators studied here. Finally, we propose a parameter-free algorithm with lower per-iteration complexity than generic solvers. Experiments validate the predicted rates and illustrate the tightness of our bounds.